Significance

Neil Lund

2025-01-06

Uncertainty and model comparison

Assume we have the correct model (this is a big assumption!)

  • How should we talk about things like uncertainty and error?

  • How do we make an argument for one model over another?

Uncertainty and model comparison

  • P.values

  • Confidence Intervals

  • Frequentist alternatives

  • Bayesian Alternatives

P-values

What does a p-value mean?

Can this British lady do wild stuff with her tea?

Muriel Bristol

Fisher’s proposed experiment:

  • 8 cups of tea. 4 milk first, and 4 tea first

  • Each cup presented in random order, and she’s asked to classify each one

  • The “null hypothesis”: she’s just guessing

What does a p-value mean?

The p-value is “the probability of correctly classifying at least N cups by guessing”

  • At least 1 correct \(= 69/70 = 99\%\)

  • At least 2 correct \(= 53/70 = 76\%\)

  • At least 3 correct \(= 17/70 = 24\%\)

  • 4 correct \(= 1/70 = 1.43\%\)

What does a p-value mean?

  • In the broadest sense, p-values indicate the probability of getting a result >, <, or absolutely greater than as the one in your sample if the null hypothesis is true.

    • Note, this is a statement about the probability of data given a model, not a statement about the probability of a model.
  • In linear regression, p-values are typically calculated based on the normal distribution

One tail or two

  • Pr(>|t|) = .225 There is a 22.5% chance of seeing a result as extreme as our sample statistic if the null hypothesis is true

  • Pr(>t) = .04 There is a 4% chance of seeing a result greater than the sample statistic if the null hypothesis is true

  • Pr(<t) = .95 There is a 95% chance of seeing a result less than the sample statistic if the null hypothesis is true

P-values and Type I error

.05 level means that around 1 in 20 of results would be false positives

What does a p-value not mean?

  • \(P(t | h_0) \neq P(h_1)\) P values are not the probability of the alternative
  • Significance does not mean the null is false
  • Statistical significance doesn’t mean practice significance.

P-values: caveats

  • P1. p-values can indicate how incompatible the data are with a specified statistical model.
  • P2. p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  • P3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  • P4. Proper inference requires full reporting and transparency.
  • P5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  • P6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

From: McShane, B. B., & Gal, D. (2017). Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association, 112(519), 885-895.

Problems with P-values

  • Often misinterpreted, even by experts

  • They’re continuous, but they tend to be treated as binary and conflated with importance

  • In large samples, they’re hard to read because they get infinitesimally small

Confidence intervals

What do confidence intervals mean?

In repeated sampling, 95% of the 95% confidence intervals will contain the correct value of \(\beta\)

Confidence Intervals

  • Its misleading to say there’s a 95% chance that the population \(\beta\) falls between the ci boundaries. (frequentist stats assumes that \(\beta\) is fixed, so it doesn’t have a “chance” of doing anything.)

  • Better to say “I’m 95% certain this range contains the true value”

Confidence Intervals

  • Pro: P.values and Confidence Intervals are based on the same basic assumptions, but CIs can be more informative about uncertainty and don’t lend themselves to the same dichotomous interpretation

  • Pro: Automatically scaled to the quantity of interest, so harder to confuse with a measure of substantive importance

  • Pro: can be used for equivalence testing (we’ll come back to this)

  • Con: also commonly misinterpreted!

Alternatives: S-Values

One proposed alternative: S-values AKA Surprisal or Shannon Entropy

\[\text{S value} = -log_2(\text{P})\]

  • Intuitive interpretation: S-values correspond to the probability of flipping a fair coin X times and having them all land on heads. So S=4 translates to flipping a coin 4 times and everything landing on heads.

Increasing values as evidence gets stronger, and the sizes are less extreme:

P value S Value
0.9 0.01
0.06 4.05
0.05 4.32
0.01 6.64
0.001 10
0.000000001 30
1e-100 332

Alternatives: S-Values

  • Pro: Still based on the same assumptions as a p-value, its just a transformation of the p.value that might be more intuitive for people.

  • Pro: easy calculation (for a computer, at least)

  • Con: Not widely adopted. All of this stuff relies on convention.

Bayesian methods

  • Bayesian methods potentially allow us to do the thing we wish we could do with p-values: quantify hypothesis probabilities

  • Downside is that it sort of requires everyone to adopt a totally different philosophical approach to probabilities.

Bayes Factors Approximation

Bayes Factors measure the weight of evidence in favor of one hypothesis relative to another.

Wagenmakers (2007) suggests a frequentist approximation of a Bayes Factor using the Bayesian Information Criterion.

  • Run a model with the variable of interest (model 1)

  • Run a model without the variable of interest (model 0)

  • Calculate the BIC for both

  • Calculate \(\frac{\exp(\text{BIC}_1)/2}{\exp(\text{BIC}_0)/2}\)

ff<-read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/causaldata/gov_transfers.csv")

model_0<-lm(Support ~  Income_Centered + Education + Age, data=ff)

model_1<-lm(Support ~  Participation + Income_Centered + Education + Age, data=ff)





exp(-BIC(model_1)/2)/exp(-BIC(model_0)/2)
[1] 7.509412

Numbers greater than 1 indicate evidence in favor of model 1, numbers less than 1 indicate evidence against model 1.

Comparing Models

  • If I have multiple models, how do I tell whether one is better than the other?

“Good” models can be bad predictors

R-squared

R-squared can be used for model comparison, but it doesn’t tell you a model is a good predictor.

Comparing models

  • Make a fair comparison: our “restricted” model \(Y_i = \bar{Y} + e_i\)
  • Our “unrestricted” comparison model: \(Y_i = \alpha + \beta_1X + e_i\)
  • How much can we improve relative to the baseline? Stated differently: our null hypothesis is that the restricted model is just as good as our new one.

Comparing models

Predicting Democratic voteshare using the 2016 share

statedemsharedemshare_2016median_age
Alabama0.3710.35639.7
Alaska0.4470.41635.6
Arizona0.5020.48138.9
Arkansas0.3580.35738.8
California0.6490.66137.5
Colorado0.5690.52737.2

Comparing models

\[ F = \frac{(RSS_{restricted} - RSS_{full})}{df_{restricted} - df_{full}} \div \frac{RSS_{full}}{df_{full}} \]

model_0 <- lm(demshare~ 1, data=voteshares)

model_1 <-lm(demshare ~ 1 + demshare_2016, data=voteshares)

anova(model_0, model_1)
Res.DfRSSDfSum of SqFPr(>F)
500.758                   
490.0089610.7494.1e+036.81e-49

Comparing models

We can continue to include additional covariates in the model. But the “fair comparison” here is probably the same model -1 covariate (rather than comparing to the null model)

Res.DfRSSDfSum of SqFPr(>F)
500.758                       
490.0089610.749   4.04e+035.31e-48
480.0088916.62e-050.357   0.553   

BIC

The F-test requires nested models, but the BIC and AIC allow comparisons across non-nested forms

\[{\displaystyle \mathrm {BIC} =n\ln(RSS/n)+k\ln(n)\ }\]

BIC(model_1)
[1] -284.4724
BIC(model_2)
[1] -280.919

Out of sample prediction

  • Basic method: set aside some observations, create model on the training set and then predict the test set

  • For small samples, K-fold or LOOCV are preferable: leave out one, predict, and then do it again. Then average the metric.

library(caret)
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
cv_1 <- train(demshare ~ demshare_2016, data = voteshares, 
               method = "lm",
               trControl = train.control)

Out of sample prediction

Linear Regression 

51 samples
 1 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 50, 50, 50, 50, 50, 50, ... 
Resampling results:

  RMSE        Rsquared   MAE       
  0.01380351  0.9871893  0.01096933

Tuning parameter 'intercept' was held constant at a value of TRUE

Out of sample prediction

…but is this the right metric?

Arguing for a null

Conventional hypothesis testing “stacks the deck” in favor of finding no relationship, and failure to reject the null could simply be a function of small sample sizes.

Still, we often want to dismiss a claim:

  • Can citizen pressure campaigns make states more effective?

  • Do motor voter laws improve turnout?

  • Are some people actually psychic?

TOST and equivalence testing

“Two one-sided tests”

Step 1: Set a minimum effect size (\(\Delta\)) worth caring about

Step 2: Conduct one-sided t-test for:

  • H01: \(\Delta \leq −Δ_l\)

  • H02: \(\Delta \geq Δ_u\)

If both of these tests are rejected, then we can conclude the observed effect is smaller than the minimum effect size.

TOST and equivalence testing

  • Eskine (2013) found that people exposed to organic foods became morally judgmental.

  • Moery and Calin-Jageman (2016) attempt to replicate this finding

https://www.nbcnews.com/news/world/does-organic-food-turn-people-jerks-flna779676

TOST and equivalence testing

Moery and Calin-Jageman’s replication results look like this:

Group N Mean SD
Control 95 5.25 0.95
Treatment 89 5.22 0.83

TOST and equivalence testing

The estimated effect: 5.25 - 5.22 = 0.03

Minimal effect size \(\Delta = 0.43\)

What is the probability of seeing a mean difference greater than or equal to 0.03 if the true difference is 0.43?

What is the probability of seeing a mean difference less than or equal to to 0.03 if the true difference is -0.43?

Negligible Effects (Rainey)

Null effects and Duverger’s law

Rainey:

  • “Hypothesis 1: Increasing the effective number of ethnic groups from the 10th percentile (1.06) to the 90th percentile (2.48) will not lead to a substantively meaningful change in the effective number of political parties when the district magnitude is one.”
  • “Hypothesis 2: Increasing the district magnitude from one to seven will not lead to a substantively meaningful change in the effective number of political parties when the effective number of ethnic groups is one.”

  • Substantively meaningful effect: changing the effective number of parties by 0.62 (where does he get that?)